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'0\ ; Abstract 

Q«^ ! The general scheme of two-level parallelization (TLP) for direct simulation Monte Carlo 
of unsteady gas flows on shared memory multiprocessor computers has been described. 

"^ ■ The high efficient algorithm of parallel independent runs is used on the first level. The 

" ■ data parallelization is employed for the second one. 

Two versions of TLP algorithm are elaborated with static and dynamic load balancing. 
The method of dynamic processor reallocation is used for dynamic load balancing. 

W i Two gasdynamic unsteady problems were used to study speedup and efficiency of 

the algorithms. The conditions of efficient application field for algorithms have been 
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1 Introduction 

1.1 Direct Simulation Monte Carlo Method and Sequential 
Algorithm in Unsteady Molecular Gasdynamics 

The Direct Simulation Monte Carlo (DSMC) is the simulation of real gas flows with various 
physical processes by means of huge number of modeling particles |jl|, each of which is a 
typical representative of great number of real gas particles (molecules, atoms, etc.). The 
DSMC method conditionally divides the continuous process of particles movement and 
collisions into two consecutive stages (motion and collision process) at each time step At. 
The particle parameters (coordinates, velocity) are stored in the computer's memory. To 
get information about the flow field the computational domain has to be divided into 
cells. The results of simulation are averaged particles parameters in cells. 




Figure 1: General flowchart of sequential algorithm for DSMC of unsteady flows. At — 
time step, At^ — interval between samples, At^ — total time of a single run, t 
— current time, n — number of runs, i — iteration number. 

The finite memory size and computer performance make restrictions to the total num- 
ber of modeling particles and cells. Macroscopic gas parameters determined by particles 
parameters in cells at the current time step are the result of simulation. Fluctuations of 
averaged gas parameters at single time step can be rather high owing to relatively small 



number of particles in cells. So, when solving steady gasdynamic problems, we have to 
increase the time interval of averaging (the sample size) after steady state is achieved in 
order to reduce statistical error down to the required level. The averaging time step Atav 
has to be much greater than the time step At {Atav ^ At). 

For DSMC of unsteady flows the value of averaging time step Atav for a given problem 
and at the current time t has to meet the following requirement: Atav ^ mint/^(x, y, z, t), 
where tn — is the characteristic time of flow parameters variation. The choice of the 



value of tjj is determined by particular problem |^, |IJ]. In order to meet the condition for 
the averaging interval we have to carry out enough number of statistically independent 
calculations (runs) n to get the required sample size. This leads to the increase of the total 
calculation time which is proportional to n in the case of sequential DSMC algorithm. 

The general flowchart of classic sequential algorithm [|I| is depicted in the fig. |1|. The 
algorithm of DSMC of unsteady flows consists of two basic loops. In the first (inner) 
loop the single run of unsteady process is executed. First, we generate particles at input 
boundaries of the domain (subroutine Generation). Then we carry out simulation of 
particle movement, surface interaction (subroutine Motion) and collision process (subrou- 
tine Interaction) for determined number of time steps At. The sampling (subroutine 
Sampling) of flow macroparameters in cells is carried out at a given moment of unsteady 
process. The inner loop itself is divided into two successive steps. At the first step 
we sequentially carry out simulation for each of Np particles independently. A special 
readdressing array is formed - subroutines Enumeration, Indexing - (it determines the 
mutual correspondence of particles and cells) after the first step. We have to know the 
location of all particles in order to fill that array. At the second step we carry out the 
simulation for each of Nc cells independently. For t > At^ we accumulate statistical data 
of flow parameters in cells. 

The second (outer) loop repeats unsteady runs n times to get the desired sample size. 
Each run is executed independently from the previous ones. To make separate unsteady 
runs independent we have to shift random number generator (RNG). 

For each unsteady run three basic arrays (P, LCR, C) are required. The array P is 
used for storing information about particles. The array LCR is the readdressing array. 
The dimensions of these arrays are proportional to the total number of particles. The 
array C stores information about cells and macroparameters. The dimension of this array 
is proportional to the total number of cells of a computational grid. The DSMC method 
requires several additional arrays which reserve much smaller memory size. The particles 
which abandon the domain are removed from the array P, whereas the new generated 
particles are inserted into the one. Since the particles move from one cell to another we 
have to rearrange the array LCR and update the array C. These procedures are performed 
at each time step At. 



1.2 Parallelization methods for DSMC of gas flows 

The feasibility of parallelization and the efficiency of parallel algorithms are determined 
both by the structure of modeling process and by the architecture and characteristics of 
a computer (number of processors, memory size, etc.). 

The development of any parallel algorithm starts with the decomposition of a general 
problem. The whole task is divided into a series of independent or slightly dependent 
sub-tasks which are solved parallel. For direct simulation of gas flows there are different 
decomposition strategies depending on goals of modeling and flow nature. The develop- 
ment of parallel algorithms for DSMC started not long ago (about 10 years ago). At the 
present time the common classification of principal types of parallel algorithms has not 
been formed yet. However, one can point out several approaches to parallelize the DSMC, 
the efficiency of which is proved by the practice of their usage. Let us conditionally single 
out four types of parallel algorithms of DSMC. 

The first type is the parallelization by coarse-grained independent sub-tasks. This 
method has been realized in [@|-0| for parallelization of DSMC of unsteady problems. 
The algorithm consists in the reiteration of statistically independent modeling procedures 
(runs) of a given fiow on several processors. 

The second type is the spatial decomposition of a computational domain. The cal- 
culation in each of regions are single sub-tasks which are solved parallel. Each processor 
performs calculations for particles and cells in its own region. The transfer of particles 
accompanies with data exchange between processors. Therefore, these sub-tasks are not 
independent. 

This method of parallelization is the most widespread at the present for parallel DSMC 



of both steady and unsteady fiows |]3]-[0I- The main advantage of this approach is the 
reduction of memory size required by each processor. This method can be carried out on 
computers with both local and shared memory. The method has drawback for increasing 
number of processors: the increase of the number of connections between regions and the 
increase of relative amount of data to exchange between regions. The essential condition of 
high efficiency of this method is the ensuring of uniform load balancing and minimization 
of data exchange. One can use static and dynamic load balancing to make good load 
balancing. The modern parallel algorithms of this type usually employ dynamic load 
balancing. 

The third type is the algorithmic decomposition. This type of parallel algorithms 
consists in the execution of different parts of the same procedures on different processors. 
For realization of these algorithms it is necessary to use a computer with architecture 
which is adequate to a given algorithm. The examples of this type of algorithm is the 



data parallelization |T2|, [13 1 



The fourth type is the combined decomposition which includes all types considered 
precedingly. The decomposition of computational domain with data parallelization are 



carried out in |12]. In this paper we shall consider two- level algorithms which include 
methods of first and third type. 



1.3 Algorithm of Parallel Statistically 
Independent Runs (PSIR) 



The statistical independence of single runs make it possible to execute them parallel. The 
general flowchart of the PSIR algorithm is depicted in the fig. ^j. The implementation 
of this approach on a multiprocessor computer leads to the decrease of the number of 
iterations of the outer loop for every single processor [n/p — the number of iterations 
for the p-processor computer). The data exchange between processors goes after all 
calculation are finished. Only one processor sequentially analyzes the results after data 
exchange. The range of efficient application field for this algorithm is p < n. The value 
of n has to be multiply by p to get optimal speedup and efficiency. 
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Figure 2: General flowchart of PSIR algorithm; m — processor ID, p — number of 
processors. 

All arrays (P, LCR, C, etc.) are stored locally for each run. This algorithm can be 
realized on computers with any type of memory (shared or local). The message passing is 
used to perform data exchange on computers with local memory. The scheme of memory 
usage is presented in the flg. ^ The required memory size for this algorithm is proportional 
to p. 



I I local arrays 
■<^ data exchange 



Figure 3: Scheme of memory usage for PSIR algorithm (the case of three processors). 



The speedup Sp and the efficiency Ep of parallel algorithm with a parallel fraction of 
computational work a for the computer with p processors are as follows [IIH]: 



Sp{p,a) = -i. 



(1) 



S 
Ep{p,a) = ^, 
p 



(2) 



where Ti — the execution time of the sequential algorithm, Tp — the execution time of 
a given parallel algorithm on the computer with p processors {p — number of reserved 
processors). In this paper we use a model of computational process which assumes that 
there is some parallel fraction a of total calculations and sequential fraction (1 — a). The 
parallel and sequential calculations are not coincided. 
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Figure 4: Speedup Sp as a function of number of processors p for various parallel fractions 
a (left), speedup Sp as a function of parallel fraction a for various number of 
processors p (right). 



The value of Tp is given by 



Tp = \{\- a) + a/p]Ti. 



(3) 



Figure 5: Efficiency Ep as a function q g 
of parallel fraction a for q. 
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To get the value of a one may use a profiler. The final formulas for Sp and Ep are as 
follows: 



Sp{p, a) 



Ep{p, a) 



p 



p — a{p — 1) 
1 



(4) 



(5) 



^1 — a)p + a 

The formula (^ presents a simple and general function, called the Amdahl law. 
According to this law, the speedup upper limit at p ^ oo for an algorithm, which has 
two non-coinciding parallel and sequential parts, is as follows: 

1 



Spip, a) < 



a 



(6) 



To speed up calculations we have to speed up parallel computations, however, the 
remaining sequential part slows down the overall computing process to more and more 
extent. Even small sequential fraction may reduce greatly the overall performance. 

The figure § shows the speedup Sp as a function of number of processors p and parallel 
fraction a. The efficiency Ep as a functon of a is shown in the fig. |^. Sequential com- 
putations affected speedup and efficiency particularly in the region a > 0.9. Therefore, 
even small decrease of sequential computations in algorithms with high parallel fraction 
makes speedup and efficiency abruptly increase (at relatively high p). 

The PSIR algorithm is coarse-grained and has high efficiency and great degree of 
parallelism comparing to any other parallel algorithm of DSMC of unsteady flows for the 
number of processor p < n. The maximum value of speedup for this algorithm can be 
obtained ai p = n. The potential of speedup which gives the computer is surplus for 
p > n. Thus, the PSIR algorithm for DSMC of unsteady flows has the following range of 



efficient usage: n 3> 1 and n > p. The value of parallel fraction a can be very high (up 
to 0.99 -7- 0.999) for typical problems of molecular gasdynamics [|]. The corresponding 
speedup is 100 ^ 1000. To get the efficiency Ep > 0.5 at n = 100 -^ 1000 it is necessary to 
have p = 100 -f- 1000 respectively. 



1.4 Data Parallelization (DP) of DSMC QT3 



The computing time of each DSMC problem is determined by the inner loop (1) time. The 
duration of this loop depends on the number of particles in the domain and the number 
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Figure 6: General flowchart of DP algorithm; j — index of data element 



of cells. It was stated above that the inner loop consists of two consecutive stages. The 
data inside each stage are independent. The elements P[k] are processed at the ffist 
stage, whereas the elements C [k] — at the second one (the elements of arrays P and C are 
mutually independent). Since the operations on each of these elements are independent it 
is possible to process them parallel. Each processor takes elements from particle array P 
and cell array C according to its unique ID-number, i.e. the m-th processor takes the m- 
th, {m + p)-th, {m + 2p)-th, etc. elements, where "m" is the processor ID-number. This 



rule of particle selection provides good load balancing because various particles require 
different time to process and they are located randomly in the array P . 
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Figure 7: Scheme of memory usage for DP algorithm (three processors) 

The synchronization of processors is performed before the next loop iteration starts. 
Before the second stage begins it is necessary to fill the readdressing array LCR. The 
complete information about the array P is required for readdresing procedure. This task 
can not be parallelized, so it is performed by one processor. There are two synchronization 
points before the readdressing and after the one. The reduction of the computational time 
is due to the decrease of the amount of data which has to be processed by each processor 
{Np/p and Nc/p instead of Np and Nc). After the inner loop is passed the processors also 
need to get synchronized. The figure |^ shows the general flowchart of DP algorithm. 

The data from the array P is required to perform the operations on elements of array C. 
This data is located in the array P randomly. These arrays are stored in the shared 
memory in order to reduce the large data exchange between processors. The memory 
conflicts (several processors read the same array element) are excluded by the algorithm. 
The semaphore technique is used for processors synchronization. The scheme of memory 
usage is depicted in the fig. 0. 

2 Algorithm of Two-Level Parallelization with Static 
Load Balancing 

It was stated above that the potential of the multiprocessor system is surplus for the 
realization of the PSIR algorithm when the required number of statistically independent 
runs n is significantly less than the number of processors p {n <^ p). In this case 
the efficient usage of computer resources of p-processor system can be provided by the 
implementation of an algorithm of two- level parallelization (TLP algorithm). The general 
flowchart of TLP algorithm is shown in the fig. §. The first level of parallelization 
corresponds to the PSIR algorithm, the data parallelization is employed for the second 
level inside each independent run. The TLP algorithm is a parallel algorithm with static 
load balancing. 
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Figure 8: Algorithm of two-level parallelization. At — time step, Atg — interval between 
samples, At^ — time of a single run, t — current time, i — run number (first 
level), pi — number of first level processors, m — second level processor ID- 
number, P2 — number of second level processors, j — index of array element. 
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Figure 9: Scheme of memory usage for TLP algorithm (pi = 2, p2 = 3) 



10 



Fork P2- 1 
processes 



Qnitializatioji) 

"""1^ 

Fork Pi- 1 processes 

— I "^ 

Fork P2- 1 

processes 



X 



Shift RNG 

i 



X 



Synchronization 
Generation 



X 



IVIotion 



I 



Indexing 

i i i 

Synchronization 

Enumeration ] 

Synchronization 

i i T 




i 
Kill 
process 



X 



Kill 
pro cess 



Fork P2- 1 

processes 

I 



Synchronization 



:i: 



Kill 
process 



Kill 
process 



Averaging 



Output 



Figure 10: Flowchart of TLP algorithm 

The scheme of memory usage for TLP algorithm is depicted in the fig. ^. This 
algorithm requires the memory size to be proportional to the number of the first level 
processors which compute single runs (just the same as for the PSIR algorithm). It 
also requires the arrays for each run to be stored in the shared memory as for the data 
parallelization algorithm in order to reduce the data exchange time between processors. 
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The speedup and the efficiency of the TLP algorithm are governed by the following 
equations: 

q _ c c _ Pi P^ (7) 

'' '' pi - a,ip^ - 1) p, - a,ip, - ly ^'^ 

S 
Ep = Ep^ ■ Ep2 = , (8) 

Pl-P2 

where indices '1' and '2' correspond to parameters on the ffist level and on the second 
one. 



The figure ITO shows the detailed flowchart of TLP algorithm for unsteady flow simula- 



tion. There are flve synchronization points in the algorithm. The four of them correspond 
to the DP algorithm. The last synchronization has to be done after termination of all runs. 
The synchronization is employed with the aid of the semaphore technique. In this version 
the iterations of the outer loop (2) are fully distributed between the flrst level processors. 
This algorithm requires n to be multiply by p for uniform distribution of computer 
resources between single runs. In order to make the runs statistically independent we 
have to shift the random number generator in each run. 

The HP/Convex Exemplar SPP-1600 system with 8 processors, 2Gb of memory and 
peak performance 1600 Mflops was used for algorithm test. 

To simulate the conditions of a single user in the system we measured the execution 
time of the parent process which makes the start-up initialization before forking child 
processes and data processing after passing parallel code (this process has the maximum 
execution time). 

The amount of parallel and sequential code was obtained from the program proflling 
data using standard cxpa utility. 

The simulation of unsteady 3-D water vapor flow in the inner atmosphere of a comet 
was carried out in order to study the speedup and the efficiency which yields this algo- 
rithm. The number of the flrst level processors pi was flxed and equal to 6. The number 
of the second level processors p2 was varied from 1 to 6. The value of parallel fraction ai 
and a2 were 0.998 and 0.97 respectively. The flgure |ll| depicts the experimental results 
(circles) and theoretical curves for speedup and efficiency as functions of the total number 
of processors p = pi ■ P2- The same flgure shows the value (marked by cross-sign) of 
speedup and efficiency of the PSIR algorithm (TLP algorithm turns into PSIR algorithm 
at p2 = 1). 

Thus, the TLP algorithm gives the possibility to signiflcantly reduce the computational 
time required for the DSMC of unsteady flows using multiprocessor computers with shared 
memory. The range of the efficient usage of this algorithm is the condition n <C p. 
Moreover, the number of processors p has to be multiply by n in order to provide good 
load balancing. 
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Figure 11: Speedup Sp (top) and efficiency Ep (bottom) of TLP algorithm (circles — 
experiment, curve — formulae (|^||), cross — speedup of PSIR algorithm) 

3 Algorithm of Two-Level Parallelization 
with Dynamic Load Balancing 

The TLP algorithm with static load balancing described in section ^ has several draw- 
backs. It does not provide good load balancing (hence, one may get low efficiency) in the 
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following cases: 

1. the ratio p/pi is not integer (part of processors are not used); 

2. each run has non-parallelized code with total sequential fraction equal to /9*, which 
depends on the starting sequential fraction (3 = 1 — a and the number of proces- 
sors P2: 

At small values of a or large values of p2 some processors may be idle in each run. This 
leads to non-efficient usage of computer resources for high values of pi. 

The increase in efficiency can be obtained by usage of dynamic load balancing with the 
aid of dynamic processor reallocation (DPR). The idea of the algorithm is as follows. Let 
us conditionally divide all available processors into two parts: leading processors pi and 
supporting processors which form the so called "heap" (the number of heap-processors is 
p — pi). Each leading processor is responsible for its own run. This algorithm is similar to 
that of TLP but here there is no hard link of heap-processors with the specific run. Each 
leading processor reserves the required number of heap-processors before starting parallel 
computations (according to a special allocation algorithm). After exiting from parallel 
procedure the leading processor releases allocated heap-processors. This algorithm makes 
it possible to use idle processors more efficiently, in fact this leads to execution of parallel 
code with the aid of more processors than in the case of TLP algorithm with static load 



balancing. The flowchart of TLPDPR algorithm is presented in the flg. |T2. 

The speedup which yields this algorithm is determined by the following basic param- 
eters: the total available number of processors in the system p, the required number of 
independent runs pi = n [pi <C p) , the sequential fraction of computational work in each 
run /3 and the algorithm of heap-processors allocation. In this paper we use the following 
allocation algorithm: 

P2 = (1 + PRI)p2, PRI = O...PRr, (10) 

where pg — the actual number of the second level processors, PRI — the parameter which 
is estimated by experimental results of similar problems, PRI* — the estimated upper 
limit of the efficient range of parameter PRI. 

In case of p being multiply by pi and the value of PRI is equal to 0, this algorithm turns 
into TLP algorithm. The speedup on the second level Sp^ is governed by the following 
equation: 

^P2 = n I 1-/3 • (^1) 

P ^ P2(l+PRI) 

The case when parameter PRI exceeds a threshold leads to the decrease of speedup Sp^. 
This decrease is not governed by (|l5) owing to overstating demands made by allocation 

14 



Initializatiorv, 



Fork p - 1 processes 



Initialize p^ leading processes 



Inintialize HEAP 



Shift RNG^l f Shift RNG] L (P - Pi processes ) 




Kill leading 
processes 



C^ OutpuP > 



Figure 12: Flowchart of TLPDPR algorithm 

algorithm on system resources. As a result, this leads to worse load balancing. The upper 
limit of the efficient range of parameter PRI can be estimated by the following condition: 

, , P-Pi 



:i + PRr)p2. 



(12) 



(1 - P*)Pi 

It means that we have to find such a value of parameter PRI for which there is a uniform 
distribution of all idle processors at a given moment among runs which perform parallel 
computations. The condition for PRI* as a function of (3 and p2 can be derived from (^) 
and (|1|): 



PRF 



l-P 



{P2 



1 . 



(13) 



The expressions discussed precedingly are undoubtedly correct for pi ^ 1. The value of 
Sp^ at PRI = PRI* gives the upper limit of speedup for a given problem. 

To study the characteristics of TLPDPR algorithm we solve the problem on unsteady 
fiow past a body. The value of sequential fraction /5 = 0.437, pi = 6. The speedup as a 
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Figure 13: Speedup on the second level as a function of number of second level processors 
P2 for algorithms of TIP (PRI = 0, dashed line) and TLPDPR (PRI = PRP, 
solid line), circles — experiment (pi = Q, p = 36, PRI = 0), asterisk — optimal 
value of parameter PRI 



function of ^2 {P2 = 1 • • • 6) for PRI = and PRI = PRI* is depicted in the fig. |T3[ The 
same figure shows the results of calculation for PRI = 0, the dot (marked by asterisk) 
corresponds to the optimal value of parameter PRI for P2 = Q {p = 36). The maximum 
speedup Sp^ with a given degree of parallelism (pi — > oo) , which can be estimated by the 
formula ([TT|), comes to 2.3. The TLP algorithm gives the speedup (/3 = 0.437, pi = 6, 
p = 36) which is 80% of the maximum value. At optimum value of parameter PRI the 
TLPDPR algorithm gives 93% for the same case. This is equivalent to the usage of TLP 
algorithm on a 120-processor computer [p = 120, pi = Q, P2 = 20). The figure |1^ shows 
speedups of TLP and TLPDPR algorithms as functions of p2 for various (3. 

The essential question one can raise about TLPDPR algorithm usage is how to de- 
termine the optimal value of parameter PRI apriori. The value given by (jT^) determines 
the upper limit of efficent range of parameter PRI = 1 . . . PRI*. The study of influence 
of parameter PRI on the speedup is presented in the fig. |15| for pi = 6, P2 = G {p = 36). 
The formula (0) gives good approximation of experimental results for the intial range 
of parameter PRI. Further, we see the predicted above decrease of speedup owing to 



16 




10 20 

Processors (p2) 



40 



Figure 14: Speedups of TLP (dashed line) and TLPDPR (solid line) algorithms as 
functions of number of the second level processors p2 for various parallel 
fractions j3 on the second level 

inconsistency of available and required system resources. The latter can be explained in 
the following manner. In (|TT]) it is supposed that released heap processors are allocated 
instantly in the other runs. Actually, these processes are non-coinciding, therefore the 
condition (|Tl|) requires a probability coefficient which is a function of parameters of a 
problem and a computer. This coefficient has to determine the probability to meet 
requirements for system resources while allocating heap processors. 

The great flexibility of this algorithm allows its efficient usage for calculation of both 
steady and unsteady problems. In case of steady-state modeling it is possible to perform 
an additional ensemble averaging for smaller number of modeling particles. This can lead 
to shorter computation time comparing to DP algorithm. The implemented TLPDPR 
algorithm has the following advantages comparing to the TLP algorithm with static load 
balancing: 

• TLPDPR algorithm makes it possible to minimize the latency time of processors. 
It provides better load balancing; 

• Better load balancing make it possible to get higher speedups under the same 
conditions. 
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Figure 15: Speedup on the second level Sp^ as a function of PRI (pi = 6, p = 36, 

PRI = O...PRr). Solid line — theory, dashed line — experiment 
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